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Abstract 

Clustering is considered a non-supervised learning setting, in which the goal is to partition 
a collection of data points into disjoint clusters. Often a bound k on the number of clusters is 
given or assumed by the practitioner. Many versions of this problem have been defined, most 
notably fc-means and fc-median. 

An underlying problem with the unsupervised nature of clustering it that of determining a 
similarity function. One approach for alleviating this difficulty is known as clustering with side 
information, alternatively, semi-supervised clustering. Here, the practitioner incorporates side 
information in the form of "must be clustered" or "must be separated" labels for data point 
pairs. Each such piece of information comes at a "query cost" (often involving human response 
solicitation) . The collection of labels is then incorporated in the usual clustering algorithm as 
either strict or as soft constraints, possibly adding a pairwise constraint penalty function to the 
chosen clustering objective. 

Our work is mostly related to clustering with side information. We ask how to choose the 
pairs of data points. Our analysis gives rise to a method provably better than simply choosing 
them uniformly at random. Roughly speaking, we show that the distribution must be biased 
so as more weight is placed on pairs incident to elements in smaller clusters in some optimal 
solution. Of course we do not know the optimal solution, hence we don't know the bias. Using 
the recently introduced method of e-smooth relative regret approximations of Ailon, Begleiter 
and Ezra, we can show an iterative process that improves both the clustering and the bias in 
tandem. The process provably converges to the optimal solution faster (in terms of query cost) 
than an algorithm selecting pairs uniformly. 

1 Introduction 

Clustering of data is probably the most important problem in the theory of unsupervised learning. 
In the most standard setting, the goal is to paritition a collection of data points into related 
groups. Virtually any large scale application using machine learning either uses clustering as a 
data preprocessing step or as an ends within itself. 

In the most tranditional sense, clustering is an unsupervised learning problem because the 
solution is computed from the data itself, with no human labeling involved. There are many 
versions, most notably /c-means and A:-median. The number k typically serves as an assumed upper 
bound on the number of output clusters. 

An underlying difficulty with the unsupervised nature of clustering is the fact that a similarity 
(or distance) function between data points must be chosen by the practitioner as a preliminary 



step. This may often not be an easy task. Indeed, even if our dataset is readily embedded in 
some natural vector (feature) space, we still have the burden of the freedom of choosing a normed 
metric, and of applying some transformation (linear or otherwise) on the data for good measure. 
Many approaches have been proposed to tacle this. In one approach, a metric learning algorithm 
is executed as a preprocessing step in order to choose a suitable metric (from some family). This 
approach is supervised, and uses distances between pairs of elements as (possibly noisy) labels. 
The second approach is known as clustering with side information, alternatively, semi-supervised 
clustering. This approach should be thought of as adding crutches to a lame distance function the 
practitioner is too lazy to replace. Instead, she incorporates so-called side information in the form 
of "must be clustered" or "must be separated" labels for data point pairs. Each such label comes 
at a "query cost" (often involving human response solicitation). The collection of labels is then 
incorporated in the chosen clustering algorithm as either strict constraints or as soft ones, possibly 
adding a pairwise constraint penalty function. 



1.1 Previous Related Work 



Cl ustering with s i de inf ormat ion is a fairly new var iant of clustering first described, independently, 
bv iDemiriz et all [l999l ]. and iBen-Dor et all [l999l ]. In the machine learning community it is also 
widely known as semi-supervised clustering. There are a few alternatives for the fo rm of feedback 



provi ding the side- information. The most natural ones are the s ingle item labels [e.g.. lDemiriz et al 



1 999 ], and the pairwise constraints [e.g., Ben-Dor et al. . 19991 ] 



In our study, the side information is pairwise, comes at a cost and is treated frugaly. In a 
related yet different setting, similarity information for all (quadratically many) pairs is available 
but is noisy. The combin atorial optim i zation theoretical proble m of cleaning the no ise is known 



as correlation clustering Bansal et al.l . 120021 ] or cluster editing Shamir et all |2004|. Constant 



factor approximat ions are known for various versions of this problems [Charikar and Wirthl . |2004| . 
Ailon et al 1 l2008ll. A PTAS is known for a minimization version in which the number of clusters 



Giotis and Guruswami . 2006]. 



is fixed 

Roughly speaking, there are two main approches for utilizing pairwise side information. In the 
first approach, this information is used to fine tune or learn a distance function, which is then passe d 



on t o any standard c lustering algorithm. Examples include Cohn et al. 2000l ]. Klein et al. 20021 ] 



and lXing et al.l 20021 ] . The second approach, which is the starting point to our work, modifi es the 
clustering algorithms's objective so as to incorporate the pairwise constraints. iBasul 20051 ] in his 
thesis, which also serves as a comprehensive survey, has championed this approach in conjunction 
with /c-means, and hidden Markov random field clustering algorithms. 



1.2 Our Contribution 

Our main motivation is reducing the number of pairwise similarity labels (query cost) required for 
fe-clustering data using an active learning approach. More precisely, we ask how to choose which 
pairs of data points to query. Our analysis gives rise to a method provably better than simply 
choosing them uniformly at random. More precisely, we show that the distribution from which we 
should draw pairs from must be biased so as more weight is placed on pairs incident to elements 
in smaller clusters in some optimal solution. Of course we do not know the optimal solution, let 
alone the bi as. Using the rece ntly introduced method of e-smooth relative regret approximations 



(e-SRRA) of lAilon et all [2011 ] we can show an iterative process that improves both the clustering 
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and the bias in tandem. The process provably converges to the optimal solution faster (in terms 
of query cost) than an algorithm uniformly selecting pairs. Optimality here is with respect to the 
(complete) pairwise constraint penalty function. 

In Secti o n [2] w e define our problem mathematically. We then present the e-SRRA method of 



Ailon et al.l 20111 ] for the purpose of self containment in Section [3J Finally, we present our main 



result in Section HI 



2 Notation and Definitions 



Let V be a set of points of size n. Our goal is to partition V into k sets (clusters). There are 
two sources of information guiding us in the process. One is unsupervised, possibly emerging from 
features attached to each element v € V together with a chosen distance function. This information 
is captured in a utility function such as fc-means or /c-medians. The other type is supervised, and 
is encoded as an undirected graph G = (V,E). An edge (u,v) £ E corresponds to the constraint 
u,v should be clustered together and a nonedge (k, v) E corresponds to the converse. Each edge 
or nonedge comes at a query cost. This means that G exists only implicitly. We uncover the truth 
value of the predicate "(«, v) £ E" for any chosen pairs it, v for a price. We also assume that G is 
riddled with human errors, hence it does not necessarily encode a perfect k clustering of the data. 
In what follows, we assume G fixed. 

A /c-clustering C = {Ci,...,Cfc} is a collection of k disjoint (possibly empty) sets satisfying 
U Ci = V . We use the notation u =c v if u and v are in the same cluster, and it v otherwise. 

The cost of C with respect to G is defined as 



cost(C) 



E 

(u,v)eE 



■u^cv 



+ 



E : 



<-u= c v ■ 



Minimizing cost(C) over clustering s when G is known as correlation clustering (in complete 



grap hs). This problem was defined bv lBansal et al.l 2004 1 and has re ceiv ed much attention since 



(e.g. lAilon et al.1 [200^ . ICharikar et al.1 |2005l ] . iMitra and Samall |2009l |). I 1 ! iMitra and Samall |2009l ] 



achieved a PTAS for this problem, namely, a polynomial time algorithm returning a ^-clustering 
with cost at most (1 + e) that of the optimal Their PTAS is not query efficient: It requires 
knowledge of G in its entirety. In this work we study the query complexity required for achieving 
a (1 + e) approximation for cost. From a learning theoretical perspective, we want to find the best 
/c-clustering explaining G using as few queries as possibly into G. 



3 The £-Smooth Relative Regret Approximation (e-SRRA) Method 

Our search problem can be cast as a special case of the following more general learning problem. 
Given some possibly noisy structure (e.g. a graph in our case) h, the goal is to find the best 
explanation using a limited space X of hypothesis (in our case ^-clusterings) . The goal is to 
minimize a notion of a nonnegative cost which is defined as the distance d(/, h) between / € X 
and h. Assume also that the distance function d between X and h is an extension of a metric on 

1 The original problem definition did not limit the number of output clusters. 
2 The polynomial degree depends on e. 
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• Start with any solution /q G X 

• Set t <- 

• Repeat until some stopping condition: 

— Set ft <— argmin g&x Af t _ 1 (g), where Af t _ 1 is an e-SRRA for ft-\- 

- Set t <- t + 1 



Figure 1: Iterative algorithm using e-SRRA 



X . lAilon et al.1 [20111 ] have recently shown the following general scheme for finding the best / G X. 
To explain this scheme, we need to define a notion of e-smooth relative regret approximation. 

Given a solution / G X (call it the 'pivotal solution) and another solution g G X, we define 
Af(g) to be d(g,h) — d(f,h), namely, the difference between the cost of the solution g and the 
cost of the solution /. We call this the relative regret function with respect to /. Assume we have 
oracle access to a function Af : X — > R such that for all g G X, 

\A f (g)-A f (g)\<sd(f,g) . 

If such an estimator function A f exists, we say t hat it is an e-smooth regret approximation (e 



SRRA) for with respect to /. lAilon et al.l 20111 ] show that if we have an e-smooth regret ap- 



proximation function, then it is possible to obtain a (1 + e)-approximation to the optimal solution 
by repeating the iterative process presented in Figure O It is shown that this search algorithm 
converges exponentially fast to an (1 + e)-approximately optimal one. More precisely, the following 
is shown: 



Theorem 3.1. \Ailon et al\ . \20lij Assume input parameters e G (0, 1/5) and initializer /o G X of 



Algorithm^ Denote OPT := minj e ^ d(/, h). /o G X be an arbitrary function. Then the following 
holds for ft obtained in Algorithm^ for all t > 1: 

d(f t , h) < (1 + 8e) (1 + (5e)*) OPT +(5e)*d(/ , h). (3.1) 

There are two questions now: (1) How can we build efficiently? (2) How do we find argmin^g^ Af(g)? 

In the case of fc-clusterings, the target structure h is the graph G and X is the space of k- 
clusterings over V. The metric d over X is taken to be 



u,v 

where d U)V (C,C) = l u = c , v l u ^ cV + lu= c v^-u^ c ,v By defining d(C, G) := cost(C) we clearly extend d 
to a metric over X U {G}. 

4 ^-Smooth Regret Approximation for k- Correlation Clustering 

Denote cost u> „(C) = l( v ) e £l u ^^ + l^ v ^ E l u = cV , so that cost(C) = \ Y, u ^ v cost Mjl) (C). Now 
consider another clustering C. We are interested in the change in cost incurred by replacing C by 
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C, in other words in the function / defined as 



/(C) = cost(C) - cost(C) . 

We would like to be able to compute an approximation / of / by viewing only a sample of 
edges in G. That is, we imagine that each edge query from G costs us one unit, and we would 
like to reduce that cost while sacrificing our accuracy as little as possible. We will refer to the 
cost incurred by queries as the query complexity. Consider the following metric on the space of 
clusterings: 

d(C,C / ) = i^<„(C,C / ) 

u,v 

where d U)V (C,C) = l u = c , v l u ^t cv + l u = cV l u ^ c , v . (The distance function simply counts the number 
of unordered pairs on which C and C disagree on.) Before we define our sampling scheme, we 
slightly reorganize the function /. Assume that \C\\ > | C2 1 > • • ■ > |Cfc|- Denote \Ci\ by rij. The 
function / will now be written as: 

k ( k \ 

f( c ') = E E [\ E f^ c ') + E E MC) (4.1) 

i=i «gc, y veCi j=i+i veCj J 

where 

fu,v(C) = eost U) „(C) - cost U)V (C) . 

Note that f u ,v{C) = whenever C and C agree on the pair u, v. For each i € [A;], let fi(C) 
denote the sum running over u E d in (|4.ip . so that /(C) = ^2fi(C). Similarly, we now rewrite 
d(C,C) as follows: 

k ( k \ 

d(c,o = E E E E w + \ E w ( 4 - 2 ) 

i=l u&d \j=i+lv£Cj veC l J 

and denote by di(C) the sum over u £ Cj for i fixed in the last expression, so that d(C,C) = 

Eti^(C). ' 

Our sampling scheme will be done as follows. Let e be an error tolerance function, which we 
set below. For each cluster 6 C and for each element u £ Ci we will draw fe — £ + 1 independent 
samples S^, ifw^i) , . . . , S u f. as follows. Each sample S u j is a subset of Cj of size g (to be defined 
below), chosen uniformly with repetitions from Cj. We will take 

q = C2A; 2 logn/e 4 . 

where 5 is a failure probability (to be used below), and C2 is a universal constant. 
Finally, we define our estimator / of / to be: 

/V) uo+EE E ^ E wo • 

i=l «eCi vG5 ui i=l u£d j=i+l veS u .j 
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Clearly for each C it holds that /(C) is an unbiased estimator of /(C). We now analyze its 
error. For each [k] let denote Ci Pi Cj. This captures exactly the set of elements in the 

i'th cluster in C and the j'th cluster in C. The distance d(C,C) can be written as follows: 



k k k 

d ( C ' C ') = 2EEl^ x ^\^)l + E E l^uXC^I. (4.3) 



i=l j=l j=l l<ii<i2<k 



We call each cartesian set product in (14. 3p a distance contributing rectangle. Note that unless a 
pair (u,v) appears in one of the distance contributing rectangles, we have f u ,v{C) = f u ,v(C) = 0. 
Hence we can decompose / and / in correspondence with the distance contributing rectangles, as 
follows: 



/( C/ ) = ^EE^( C ') + E E p hM ( 4 - 4 ) 

i=l j=l j=l l<ii<i2<k 

/( c/ ) = ^EE^(co + E E 4,^ (4-5) 



2 

i=l j=l j=l l<ii<«2<fc 



where 



FiA?) = E E f^{C) (4.6) 
i^-(C) = ^ E E (4.7) 
Fh,nAC')= E E ^AC) (4-8) 

4aj(C) = ^ E E ^ C ') (4.9) 

(Note that the 5 u j's are multisets, and the inner sums in (|4.7p and (|4.9p may count elements multiple 
times.) 

Lemma 4.1. VFii/i probability at least 1 — n _3 ; i/ie following holds simultaneously for all k- 
clusterings C and all i,j £ [fe]: 

IF^-CO - i^C)) < £ • \Cij x (Q \ Cy)| • (4-10) 

Proof. Given a fc-clustering C = {C(, . . . , C£}, the predicate (|4.10p (for a given depends only 
on the set Cy = C, n Cj. Given a subset B Q Ci, we say that C (i, j)-realizes -B if Cjj = 5. 

Now fix i,j and -B C Cj. Assume a fc-clustering (i, j)-realizes B. Let 6 = |5| and c = 
|Cj|. Consider the random variable Fy(C) (see (|4.7j) L Think of the sample S u { as a sequence 
iSui(l), • • • , S U i(q), where each S u i(s) is chosen uniformly at random from for s = 1, . . . , q We 
can now rewrite F^ifi') as follows: 

y u€B s=1 
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where 

I fu,v(C) V £ Ci\ C{j 



X(v) 



otherwise 



For all s = 1, . . . q the random variable X(S u i(s)) is bounded by 2 almost surely, and its moments 
satisfy: 



E[X(S ui (s))] =- V fuAC 



C ue(Ci\c?«) 

E[X(S ul {s)f] <fc^l . (4.11) 

c 

From this we conclude using Bernstein inequality that for any t < b(c — b), 

P rl |^(CO-F„ J (C')|> t ]<exp{- Ij ^- Ty } 

Plugging in t = eb(c — 6), we conclude 

Pr[|F M (C) - F id (C')\ > eb(c - b)] < exp j- ^" 6) } 

Now note that the number of possible sets B C Cj of size 6 is at most n min ^' c - 6 >. Using union 
bound and recalling our choice of q, the lemma follows. 

□ 

The Lemma can be easily proven using the Bernstein probability inequality. A bit more involved 
is the following: 

Lemma 4.2. With probability at least 1 — n~ 3 ; the following holds simultaneously for all It- 
clusterings C and for all ii,i2,j € [k] with i\ < i%: 

\F n ^(C')-F lul2 ,(C')\ < ,max{|C nj x C l2J \, ^ * ^ ^ X ^ ^ } (4.12) 

Proof. Given a /c-clustering C = {C[, . . . ,C' k }, the predicate (|4.12j) (for a given h,i2,j) depends 
only on the sets C^j = C^ n and Cj 2j - = Cj 2 n Cj. Given subsets 2?i C d 2 and 1?2 C Cj 2 , we 
say that C («i,*2> j)-realizes (B\,B2) if C^j = i?i and Cj 2 j = i?2- 

We now fix %\ < i2,j and B\ C C^, i?2 ^ Cj 2 . Assume a /c-clustering C «2,i)-realizes 
{B\,B2). For brevity, denote 6 t = |-BJ and c t = |CjJ for 1 = 1, 2. Using Bernstein inequality as 
before, we conclude that 

Pr[|F il)i2j (C') - 4,i 2 ,;(Ol >t]< exp {-^-) • (4.13) 
for any t in the range [0, 6162], for some global C4 > 0. For t in the range (6162, 00), 

Pr[|iWC) " 4aj(OI > *] < ex P {"^} • ( 414 ) 
We consider the following three cases. 
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1. fei&2 > max{&i(ci — 61/^,62(^2 — t>2)/k}. Hence, b\ > (C2 — ^2)/^, ^2 > (ci — b\)/k. In this 
case, plugging in (|4. 13|) we get 

Pr[\F iui2j (C) - F il>i2j (C')\ > eb x b 2 \ < exp |_^£^^| . (4.15) 

Consider two subcases, (i) If 62 > C2/2 then the RHS of (|4.15[) is at most exp |— C3£ 2 6lg |. 

The number of sets B\,B2 of sizes 61,62 respectively is clearly at most n bl+ ^ C2 ~ b2 " > < n bl+kbl . 
Therefore, if q = 0(e~ 2 /c log n), then with probability at least 1 — n -6 simultaneously for 
all B\,B2 of sizes 61,62 respectively and for all C (ii,i 2 , ^-realizing (£>i,I?2) we have that 
\Fi lt i 2 j(C') — ^ii,i 2 ,j(C')| < e6i&2 . (h) If 62 < C2/2 then by our assumption, 61 > c 2 /2fc. 

Hence the RHS of (|4.15p is at most exp | — C36 2 ^ 29 \ ■ The number of sets Bi, B2 of sizes 61, 62 

respectively is clearly at most n^ Cl ~ bl ^ +C2 < n b2<yl+k \ Therefore, if q = 0(e~ 2 k 2 logn), then 
with probability at least 1 — n -6 simultaneously for all B\,B2 of sizes 61,62 respectively and 
for all C (ii, z 2 , j)-realizing {B\,B2) we have that \Fi lt i 2i j(C') — Fi 1: i 2 ^(C')\ < 56162 . 

2. 62(02 — 62)/^ > max{6i62, 6i(ci — 6i)/A:}. We consider two subcases. 

(a) £62(02 — 6 2 )/fc < 6i6 2 . Using (|4.13p . we get 

Pr[|JWC) " > M<* - b 2 )/k] < exp |-^£!^_M!«| (4.16) 

Again consider two subcases, (i) 62 < C2/2. In this case we conclude from (14.16P 

Pr[\F h>i2jj (C) - F iui2 j{C')\ > eb 2 (c 2 - b 2 )/k] < exp|- C3 ^ 2g | (4.17) 

Now note that by assumption 

61 < (ca - 62)/*; < c 2 /k < Cl /k . (4.18) 

Also by assumption, 61 < 62(02 — 6 2 )/(ci — 61) < 62C 2 /(ci — 61). Plugging in (|4.18p . we 
conclude that 61 < 6202/(01(1 — l/k)) < 262C2/C1 < 262. From here we conclude that the 

RHS of (p7p is at most exp {- ^Jf 29 }. The number of sets B\, B2 of sizes 61, 62 respec- 
tively is clearly at most n bl+b2 < n 2b2+b2 < n 3c2 . Hence, if q = 0(e~ 2 k 2 logn) then with 
probability at least 1— n -6 simultaneously for all B\,B2 of sizes 61, 6 2 respectively and for 
all C (n,« 2 , j)-realizing (Bi,B 2 ) we have that \Fi lji2 j(C') — Fi lt i 3 j{C')\ < eb2{c2-b 2 )/k . 

In the second subcase (ii) 6 2 > c 2 /2. The RHS of (I4TT6]) is at most exp {- £2£ %r^ 1 }- 
By our assumption, (02 — 62)/ (^61) > 1, hence this is at most exp j — C3£ ^ 2 fc b ' 2 ^ q j . 

The number of sets B\,B2 of sizes 61,62 respectively is clearly at most n bl+<yC2 ~ b2 ^ < 
n (c 2 -b 2 )/k+(c 2 -b 2 ) < n 2(c 2 -6 2 )_ Therefore, if q = 0{£- 2 k log n), then with probability 
at least 1 — n -6 simultaneously for all B\ y B% of sizes 61,62 respectively and for all C 
(«i,i2,j)- reanzin g ( B i,B 2 ) we have that \F ilti2 j(C) - F iui2i j(C')\ < eb 2 (c 2 - b 2 )/k . 
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(b) eb 2 (c 2 — b2)/k > 6i&2- We now use (|4.14p to conclude 

Pr[|F n , 2j (C) - F iui2ij (C')\ > eb 2 (c 2 - b 2 )/k] < eX pj- ^g-^ j (4.19) 

We again consider the cases (i) 62 < c 2 /2 and (ii) 62 > c 2 /2 as above. In (i), we get 
that the RHS of (j4"T9l) is at most exp {- £5 ^}, that 61 < 262 and hence the number 
of possibilities for Bi,B 2 is at most n bl+b2 < n 3b ' 2 . In (ii), we get that the RHS of 
(|4.19p is at most exp | — C5£ ( c |~ &2 ) g | ; an d the number of possibilities for B\,B 2 is at 

most n 2<yC2 ~ b2 \ For both (i) and (ii) taking q = 0(e~ 1 klogn) ensures with probability 
at least 1 — n -6 simultaneously for all B\,B 2 of sizes 61,62 respectively and for all C 
(h,i 2 , j)-realizing (B 1 ,B 2 ) we have that \F ilt i 2 j(C') - F ilti2tj (C')\ < eb 2 (c 2 - b 2 )/k . 

3. 6i(ci — b\)/k > max{6i 62, 62(02 — b 2 )/k}. We consider two subcases. 

• eb\{ci — bi)/k < b\b 2 . Using (|4.13p . we get 

Pt[\F 1u12 j(C) - F h , i2>j (C')\ > ebxid ~ h)/k] < exp |_^£!^__M^| . (420) 

As before, consider case (i) in which 62 < c 2 /2 and (ii) in which 62 > c 2 /2. For case (i), 
we notice that the RHS of (|4TT9j) is at most exp {- C3£2b2i %^ Cl ~ bl)g } ( we used the fect 

that 61 (ci — 61) > 62(02 — 62) by assumption). This is hence at most exp j — C3£ 

The number of possibilities of Bi,B 2 of sizes 61,62 is clearly at most n^ Cl ~ bl ^ +b2 < 
n (ci-6i)+(ci-6i)/fe < n 2(ci-6i)_ p rom this we conclude that q = 0{e~ 2 k 2 \ogn) suffices 

for this case. For case (ii), we bound the RHS of (|4.20|) by exp| — C3£ ^^yi 11 ^ ~ j- 
Using the assumption that (ci — b\)/b 2 > k, the latter expression is upper bounded by 
exp |— C3£biq |. Again by our assumptions, 

61 > 6 2 (c 2 - 6 2 )/(ci - 61) > (e( Cl - h)/k)(c 2 - 6 2 )/(ci - h) = <<% - b 2 )/k . (4.21) 

The number of possibilities of B\, B 2 of sizes b\, b 2 is clearly at most n bl+( - C2 ~ b2 ^ which by 
(|4.2ip is bounded by n bl+kbl l e < n 2kbl / £ . From this we conclude that q = 0(e~ 3 klogn) 
suffices for this case. 

• e6i(ci - b\)/k > bib 2 . 

Pr[|F illiaii (C) - F ilMd (C')\ > sh(c 2 - h)/k] < exp|- C5g6l( ^~ fel)g | (4.22) 

We consider two sub-cases, (i) 61 < ci/2 and (ii) 61 > ci/2. In case (i), we have that 

61 (ci -6) = 1 6! (ci -6) 1 6i( Cl -6) 
c 2 2 c 2 2 c 2 



2 2c 2 2 c 2 
> 6i/4 + min{6 2 ,c 2 -6 2 }/2 
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Hence, the RHS of K22h is bounded above by exp |_ ^ q (&i/4+min{6 2 , C2 -6 2 }/2) j Thg 

number of possibilities of B\,Bi of sizes 61,62 is clearly at most n bl+mm ^ b2 ' C2 ~ b2 ^ , hence 
it suffices to take q = 0(e _1 felogn) for this case. In case (ii), we can upper bound 
the RHS of g2Z]) by exp {-^fe^} > exp {-S*^}. The number of possi- 
bilities of B\,B2 of sizes 61,62 is clearly at most j-j,( c i~ 6 0+ 6 2 which, using our assump- 
tions, is bounded above by n( Cl ~ bl ^ + ( Cl ~ bl ^ k < n 2 ( Cl ~ b2 \ Hence, it suffices to take 
q = 0{e~ 1 k\ogn) for this case. 

This concludes the proof of the lemma. □ 

As a consequence, we get the following: 

Lemma 4.3. with probability at least 1— n~ 3 , the following holds simultaneously for all k-clusterings 
C: 

\f(C')-f(C)\<3ed(C',C) . 



Proof. 

k k 

2 



i/(C) - Km 4 E E \ F ^(c') - + E E IJWO - 4,^(0 

i=l 3=1 3=1 l<ii<«2<fc 



- 2 / ■/ / j ~ " • >J v • '• \ -^3 J 

i=l 3=1 
fc 

+ e ^] ^] x C«2il + & 1 |Cui x \ + ^ 1 1 C-i2j x (C«2 \ C«2j')l) 

3=1 ii<i2 

%EE r2rl i^ x (ca^-)i +^E E i^ii x 

i=l 3=1 3 = 1 ii<i2 

fc Pc Pc 

+ e E E E fc" 1 !^- x \ ^13)1 + ^E E E x \ 

3=1 ii = l 12=1 3=1 12=1 ii=l 

% EE £ " arl i^ x (cac^oi +^E E ^ x 

i=l 3=1 3 = 1 *i<*2 

k k k k 

+ e ^""^ ^""^ 1 1 C'i 1 j x \ C«i3')l + £ ^ 1 y ] ^fe 1 |C«23 X (^«2 \ C«23')l 

3=1*1=1 3=1 *2=1 

< e- E E x ^ \ + £ E E x ^ i2 ^'l 

1=1 j'=l 3=1 ii<«2 

< 3ed(C,C) 

The first equality was (|4.4|) - (|4,5|) . The second was Lemmas 14.11 14.21 (assuming success of a high 
probability event), the third, fourth and fifth inequalities were rearrangement of the sum, and the 
final inequality came from (|4.3|) . □ 
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5 Conclusions and Future Work 



Our study considered the information theoretical problem of choosing which questions to ask in 
a game in which adversarially noisy combinatorial pairwise information is input to a clustering 
algorithm. We designed and analyzed a distribution from which drawing pairs is provably superior 
than the uniform distribution. Our analysis did not take into account geometric information (e.g. a 
feature vector attached to each data point) and treated the similarity labels as side information, as 
suggested in a recent line of literature. It would be interesting to study our solution in conjunction 
with geometric information. It would also be interesting to study our approach in the context of 
metric learning, where the goal is to cleverly choose which pairs to obtain (noisy) distance labels 
for. 
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